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Abstract 

We use Markov risk measures to formulate a risk-averse version of the undiscounted total cost problem 
for a transient controlled Markov process. We derive risk-averse dynamic programming equations and we 
show that a randomized policy may be strictly better than deterministic policies, when risk measures are 
employed. We illustrate the results on an optimal stopping problem and an organ transplant problem. 



^ Keywords: Dynamic Risk Measures; Markov Risk Measures; Stochastic Shortest Path; Optimal Stop- 

r) ping; Randomized Policy 

o 

^ 1 Introduction 

-(— > 

a 

r^ The optimal control problem for transient Markov processes is a classical model in Operations Research 

I '"'i (see Veinott |44|, Pliska |31| , Bertsekas and Tsitsiklis |6|, Hemandez-Lerma and Lasserre |17|, and the 

references therein). The research is focused on the expected total undiscounted cost model, with increased 
'""' state and control space generality. 

1^ Our objective is to consider a risk-averse model. So far, risk-averse problems for transient Markov mod- 

L^ els were based on the arrival probability criteria (see, e.g., Nie and Wu ll23l and Ohtsubo ll25l ) and utility 

■^ functions (see Denardo and Rothblum 1 1(51 and Patek fSO"]). We plan to use the recent theory of dynamic risk 

l/^ measures (see Scandolo |40|, Ruszczyriski and Shapiro |37, 39], Cheridito, Delbaen and Kupper |7|, Artzner 

*y-^ et. al. |3|, Kloppel and Schweizer ll20l . Pflug and Romisch ll30l . and the references therein) to develop and 

f^ solve new risk-averse formulations of the stochastic optimal control problem for transient Markov models. 

CN Specific examples of such models are stochastic shortest path problems (Bertsekas and Tsitsiklis |6|) and 

'""' optimal stopping problems (cf. ^inlar ||9l, Dynkin and Yushkevich lfTn[T2l . F^iterman 1321 ). 

J> Some applications of stochastic shortest path problems concerned with expected performance criteria are 

given in the survey paper by White |45 1 and the references therein. However, in many practical problems, 
the expected values may not be appropriate to measure performance, because they implicitly assume that the 
C^ decision maker is risk-neutral. Below, we provide examples of such real-life problems which were modeled 

before as a discrete-time Markov decision process with expected value as the objective function. 

Alagoz et. al. |[T| suggest a discounted, infinite horizon, and absorbing Markov decision process model 
to find the optimal time of liver transplant for a risk-neutral patient under the assumption that the liver is 
transferred from a living donor. However, refeiTing to Chew and Ho |8|, they state that the risk-neutrality 
of the patient is not a realistic assumption. In that study, the patient can be in one of the states "transplant," 
"death" and intermediate states corresponding to increasing sickness. The decisions are either to wait or 
to transplant. The "death" and "transplant" states are absorbing states with zero reward. Therefore, the 
undiscounted version of the model reduces to a stochastic longest path problem. 

*Rutgers University, RUTCOR, Piscataway, NJ 08854, USA 

^Rutgers University, Department of Management Science and Information Systems, Piscataway, NJ 08854, USA 



A stochastic shortest path problem can be used to find the optimal replacement time of a system. Kurt 
and Kharoufe f^Tl propose a discounted, infinite horizon Markov decision process model to solve a similar 
problem for a system under Markovian deterioration and Markovian environment. They assume that the sys- 
tem returns to the "new" state after it is replaced at a given cost. The state space depends on the environment 
and deterioration levels of the system. The decisions are either to replace the system at a replacement cost 
or to maintain it at a maintenance cost. Furthermore, we can consider another control "do nothing," to leave 
the system in operation without any maintenance or replacement at zero cost. They state that their problem 
can also be equivalently formulated as a stochastic shortest path problem with some probability of making a 
transition from each state to a zero-cost absorbing state. However, managers are not risk-neutral in real life 
and this needs to be considered in such replacement problems (see Tapiero and Venezia B3JI ). 

So and Thomas (4T\ employ a discrete time Markov decision process to model profitability of credit cards. 
The objective is to find a policy which maximizes the expected total discounted profit of the creditor. The state 
space depends on the customer's riskiness and the credit limit bands. Additionally, there are absorbing states 
which represent account closure and different classes of default. The decisions are either to increase the credit 
limit or keep it unchanged. If zero reward is collected at some of the absorbing states (e.g. closed account), 
then the undiscounted version of the model reduces to a stochastic longest path problem. However, creditors 
are assumed to be risk-neutral in these expected-value models, which may not be a realistic assumption. 

Our theory of risk-averse control problems for transient models applies to these and many other models. 
Our results complement and extend the results of Ruszczyhski lf36l . where infinite-horizon discounted models 
were considered. 

In sectionl2]we quickly review some basic concepts of controlled Markov models. In sectionlSlwe adapt 
and extend our earlier theory of Markov risk measures. In section]?] we introduce and analyze the concept 
of a multikernel, which is essential for our theory. Section ]5] is devoted to the analysis of a finite horizon 
model. The main model with infinite horizon and dynamic risk measures is analyzed and solved in sections ]6] 
and]?] Sectionl8]compares randomized and deterministic polices. Finally, section ]9] illustrates our results on 
risk-averse versions of an optimal stopping problem of Karlin 1 19| and of the organ transplant problem of 
Alagoz et al. yj- 

2 Controlled Markov Processes 

We quickly review the main concepts of controlled Markov models and we introduce relevant notation (for 
details, see lfT3l[T6l[T7l ). Let j?r be a state space, and '^ a control space. We assume that ^ and ^ are Polish 
spaces, equipped with their Borel (7-algebras. A control set is a measurable multifunction U : ^ =^ '^; for 
each state x E ^ the set f/(x) C ^ is a nonempty set of possible controls at x. A controlled transition kernel 
g is a measurable mapping from the graph of U to the set ^ {,!%') of probability measures on ^ (equipped 
with the topology of weak convergence). 

The cost of transition from x to y, when control u is applied, is represented by the function c{x,u,y), 
where c : ^ x ^ x ^ —?' WL. Only u £ U{x) and those y G ^ to which transition is possible matter here, 
but it is convenient to consider the function c {■,-,■) as defined on the product space. 

A stationary controlled Markov process is defined by a state space ^, a control space ^, a control set 
U, a controlled transition kernel Q, and a cost function c. 

For f = 1,2, . . . we define the space of state and control histories up to time t as J^ = graph(f/)' x ^. 
Each history is a sequence h, = {xi,ui,... ,x,_i,m,_i,x,) e ^. 

We denote by ^{'W) the set of probability measures on the set '^ . Likewise, ^{U{x)) is the set of 
probability measures on U{x). A randomized policy is a sequence of measurable functions tt, : ,i^ — > 3^[fi/\ 
t — 1,2,..., such that %t(ht) G !P(lJ(xt)) for all /i, G ^. In words, the distribution of the control m, is 
supported on a subset of the set of feasible controls t/(x(). A Markov policy is a sequence of measurable 
functions ;r, : ^ ^ ^('2^), f = 1,2, . . ., such that %t[x) e S^(lJ{x)) for all jc e ^. The function ;:,(•) is 
called the decision rule at time t. A Markov policy is stationary if there exists a function n \ S^ ^ 3^{'^) 



such that 7tt{x) — ;r(x), for all f = 1,2, . . . and all x e ^. Such a policy and the corresponding decision rule 
are called deterministic, if for every x G JT there exists u{x) € U{x) such that the measure n{x) is supported 
on {u{x)}. 

For a stationary decision rule n, we write Q" to denote the corresponding transition kernel. 

We focus on transient Markov models. We assume that there exists some absorbing state xa G =^, such 
that 2({-''^A}|-'fA,M) = 1 and c(xa,m,ji:a) = for all u e U{xa)- Thus, after the absorbing state is reached, no 
further costs are incurredP] To analyze such Markov models, it is convenient to consider the effective state 
space ^ = ^\ {xa}, and the effective controlled substochastic kernel Q whose arguments are restricted 
to ^ and whose values are nonnegative measures on ^, so that Q(^B\x,uj = 2(b|x,m), for all Borel sets 
B C ^, all X G ^, and all u G U{x). Moreover, we assume that the following Pliska condition 1311 is 
satisfied: a weight function w : ^ -^ [li°°) and a constant K exist, such that for every Markov decision rule 



n we have 






<K. (1) 



In the condition above, the norm ||A||,v of a substochastic kernel A is defined as follows: 

\\A\l,.^,nV^Lw{j)A{dy\x). (2) 

It is the standard operator norm in the space 'Mw{,9y ,S§{^)) of measurable functions v : ^ -^ R f or which 

II II ^'i^) 

\\v\\„= sup —— <oo. 

Hernadez-Lerma and Lasserre [TTl extensively discuss the role of weighted norms in dynamic programming 
models. 

Our point of departure is the expected total cost problem, which is to find a policy 17 — {TTf }J^j so as to 
minimize the expected cost until absorption: 



min IE 
n 



Y^c[Xt,Ut,Xt+l) 



(3) 



Under standard assumptions, the problem has a solution in form of a stationary Markov policy. Moreover, 
it is sufficient to restrict the considerations to deterministic policies. The optimal policy can be found by 
solving appropriate dynamic programming equations. 

Our intention is to introduce risk aversion to problem ([3]l, and to replace the expected value operator 
by a dynamic risk measure. We shall show that the Pliska condition ([T]i is not sufficient in this case, and 
that properties of risk measures must be taken into account when considering transient models. We shall 
also show that in the risk-averse case randomized policies can be optimal, and that it is essential to consider 
general transition cost c(x,,m,,x,+i), which in problem ^ could easily be reduced to functions depending 
only on (xr,M,). We do not assume that the costs are nonnegative, and thus our approach applies also, among 
others, to stochastic longest path problems and optimal stopping problems with positive rewards. 

3 Markov Risk Measures 

Suppose ris a fixed time horizon. Each policy 17 = {;ri,7r2,.. .} results in a cost sequence Z( =c(x,_i,M;_i,X;), 
t =2, . .., T +1. We define the spaces ^, of ^,-measurable random variables on Q, t = I,.. .,T. In this 
paper, we focus on the case when 2fi = ^p{Q,^t,P), for some p E [1,°°]- 



'The case of a larger class of absorbing states easily reduces to the case of one absorbing state. 



To evaluate risk of this sequence we use a dynamic time-consistent risk measure of the following form: 

Jrin^xi) ^ pAc{xi,ui,X2) + P2\c{x2,U2,xj,) + ■ ■ ■ 

^ . (4) 

+ Pt-i{c{xt-\,ut-\,xt) + Pt{c{xt ,ut ,xt+\))) ■ ■ -j j. 

Here, p, : i^+i -^ %, t ~l,. ..,T, are one-step conditional risk measures satisfying the following conditions: 

Al. p,(aZ+ (1 - a)W) < ap,{Z) + (1 - a)p,{W), V a e (0, 1), Z,W e i2^+i; 
A2. IfZ<W then p,(Z) <Pr(W), VZ,We %+i; 
A3. p,(Z + W)=Z + p,(W), VZe j;, We^+i; 
A4. p,(j3Z) = j3p,(Z),VZe j;+i,j3>0. 

Ruszczyriski ll36l sec. 3] derives the nested formulation Q and conditions (A2) and (A3) from general 
properties of monotonicity and time-consistency of dynamic measures of risk. Conditions (Al) and (A4) are 
added to model the diversification effect and scale-invariance of the preferences, similarly to the axioms of 
coherent measures of risk (see (B1)-(B4) below). 

It is convenient to introduce vector spaces % q = %y. S't+i x • • • x iFg, where l<r<0<r + l and 
the conditional risk measures p, g : Sft.e — ^ -^ defined as follows: 

pt.eiA,. . . ,Ze) = Z, +p,(z,+i 4-p,+i {Z,+2 + • • • + pe-i (Ze) •••))■ (5) 

The operations of addition and multiplication by a scalar are defined in SCtfi in the usual way. We can also 
define the partial order relation ^ in a natural way: 

(Z„...,Ze)^(W„...,We) ^=^ Z,<W,, a.s., T = f,...,0. 

Immediately from the definition we obtain the following properties of conditional measures of risk. 

Lemma 3.1. If the one-step conditional risk measures Pi, T = f , . . . , — 1, satisfy conditions A1-A4, then 

(i) p,.e(aZ+ (1 - a)W) < ap,,e{Z) + (1 - «)p,.e(W), V a e (0, 1), Z,W e J?,e; 
(ii) IfZ :< W then p,,e{Z) < p,,e(W), VZ,W G 2f,,e; 
(iii) Pr^e{PZ)^lip,,e{Z), yze^r+u j3 >0; 
(iv) p,.e(Zr,...,Ze_i,O)=p,.e-i(Z,,...,Z0_i). 

As indicated in |36|, the fundamental difficulty of formulation Q is that at time t the value of pr(-) 
is ^,-measurable and is allowed to depend on the entire history /;, of the process. In order to overcome 
this difficulty, in 1361 sec. 4] a new construction of a one-step conditional measure of risk was introduced. 
Its arguments are functions on the state space ^, rather than on the probability space Q. This entails 
additional complication, because in a controlled Markov process the probability measure on the state space 
is not fixed, but depends on decisions u. We adapt this construction to the case of controlled Markov models 
with randomized policies. In this case, it is convenient to consider functions on the product space '^ x ^ 
equipped with its product Borel c-algebra ^. 

Suppose the current state is x and we use a randomized control A. This control, together with the transition 
kernel Q defines a probability measure XoQ^ on the product space '^ x ^ as follows: 

[XoQ,]{B,xBy)= [ Q{By\x,u)X{du), B,e^{U), By(,SS{^). (6) 



The measure is extended to other sets in ^ in a usual way. In the case of countable state and control spaces, 
[A o g^] {u,y) is the probability that control u will be used at x and the next state will be y. 

The cost incurred at the current stage is given by the function c^ on the product space 'W x J^ defined as 
follows: 

Cx{u,y) =c{x,u,y), u G 'W , yG^. (7) 

Let y — ^p{'^ X J^,^,Pq), where p e [1,°°] and Pq is some reference probability measure on "^ x ^. 
It is convenient to think of the dual space Y' as the space of signed measures m on {^ x ^ ,£§), which 
are absolutely continuous with respect to Pq, with densities (Radon-Nikodym derivatives) lying in the space 
^qifi/ X SC ,SS^P{)), where l//:i+l/^= 1. In the case of finite state and control spaces fo may be the uniform 
measure; in other cases Pq should be chosen in such a way that the measures A o Q^ are elements of y . The 
measure Pq does not play any other role in our considerations. We consider the set of probability measures in 

■yi._ 

.^ = {m e r' : m(^ x ^) = 1, m > 0} . 

We also assume that the spaces "f and "Y' are endowed with topologies that make them paired topological 
vector spaces with the bilinear form 



{(p,m)^ / (p{u,y)m{duxdy), (p e f , mey'. (8) 

The space y (and thus ^) will be endowed with the weak* topology. For p G [1 , °°) we may endow f with 
the strong (norm) topology, or with the weak topology. For p — °°, the space Y will be endowed with is weak 
topology defined by the form (Is]), that is, the weak* topology on ^^{^ ,^,Po). 

Definition 3.1. A measurable function a '.Y x ^' x ^ — > R is a risk transition mapping if for every xG 3C 
and every m S ^, the function (p H- (7(^,x,m) is a coherent measure of risk on Y. 

Recall that C7(-) is a coherent measure of risk on Y (we skip the other two arguments for brevity), if 

Bl. a{a<p + {l-a)^) < aa[q)) + {l-a)at{\f), Vae (0,1), (p,^^^; 

B2. If ^ < If/ then a{(p) < aiy/), M (p,\ifGt\ 

B3. a{a + ^) = a + a{(p), V(pey,ae'R; 

B4. a{l3(p) = j3(7(9), V 9 e r, j3 > 0. 

Example 3.1. Consider the first-order mean-semideviation risk measure analyzed by Ogryczak and Ruszczyn- 
ski 1.26. ,27 1, and Ruszczynski and Shapiro |38, Example 4.2], 1.39. Example 6.1]), but with the state and the 
underlying probability measure as its arguments. We define 

G{(p,x,m) = {(p,m) + K{x){{(p~{(p,m))+,m), (9) 

with some measurable function k: ^ — > [0, 1]. We can verify directly that conditions (B1)-(B4) are satisfied. 

Example 3.2. Another important example is the Conditional Average Value at Risk (see, inter alia, Ogryczak 
and Ruszczynski |28 Sec. 4], Pflug and Romisch |30 Sec. 2.2.3, 3.3.4], Rockafellar and Uryasev |34|, 
Ruszczynski and Shapiro ||381 Example 4.3], ||39l Example 6.2]), which has the following risk transition 
counterpart: 

Here a : ,^ — > [otmin, C«max] C (0, 1) is measurable. Again, the conditions (B1)-(B4) can be verified directly. 



We shall use the property of law invariance of a risk transition mapping. For a function (p ^ Y and a 
probability measure ji £ ./M we can define the distribution function f,p : R — > [0, 1] as follows 

F^(j])=\i{{u,y) e ^ X jr : ip{u,y) <^). 

Definition 3.2. A risk transition mapping a :Y x ^ x ^ — > R is law invariant, if for all (p,\l/ E f and all 
fijV G ^ such that FJn = Fl, we have (j(^,x,/i) = G{y/,x,v)for all x S J^T. 

The concept of law invariance corresponds to a similar concept for coherent measures of risk, but here we 
additionally need to take into account the variability of the probability measure. The risk transition mappings 
of Examples |3.1| and |3.2| are law invariant. While we shall not directly use law invariance in our main 



theoretical considerations, it greatly simplifies the analysis of specific problems, as illustrated in section 9.1 

Risk transition mappings allow for convenient formulation of risk-averse preferences for controlled Markov 
processes, where the cost is evaluated by formula Q. Consider a controlled Markov process {x,} with some 
Markov policy 17 = {tti , 712, . . . }. For a fixed time t and a measurable function g:^x'^x^^Wi the 
value of Z(+i = g{x,,Ut,x,^i) is a random variable. We assume that g is w-bounded, that is, 

\g{x,u,y)\ <C{w{x)+w{y)), VxG JT, ueU{x), y £ SC , 

for some constant C > and for the weight function w : .^ — > [1 , 0°), w G 1^. Then Z,+i is an element of i?J+i . 
Let p, : 2fi+i — >■ i^ be a conditional risk measure satisfying (A1)-(A4). By definition, p,(g(x,,M(,x,+i)) is 
an element of 3ft, that is, it is an J?"(-measurable function on (Q,,^). In the definition below, we restrict it 
to depend on the past only via the current state x,. We write ^^ : ^ x ^ -> R for the function gx{u,y) = 
g{x,u,y), Kx for the measure 7r(-|jc), and Qx for the mapping u — > Q{-\x,u). 

Definition 3.3. A one-step conditional risk measure pt : ^+1 -^ ^t is a Markov risk measure with respect 
to the controlled Markov process {xf}, if there exists a risk transition mapping o^ \ "V x 3^ x ^ — >■ R 
such that for all w-bounded measurable functions gi^x'^x^^M. and for all feasible decision rules 
n: .1' -^ ,^{U) we have 

p,{g{x,,u,,xt+i)) =a,{gx,,x,,nx,oQx,), a.s. (10) 

Observe that the right hand side of formula ( fTO] ) is parametrized by Xt, and thus it defines a special ^,- 
measurable function of co, whose dependence on the past is carried only via the state x,. 

Remarli 3.1. If c(x,,m,,x,+i) = ii(x(,x,+i), or if randomized policies are not allowed, then it is sufficient to 
start from a probability measure Pq on 3^ and define f = ^p{^ ,S§{3y),Po), Y' - the set of measures on 
(^, ^(^)) having densities with respect to Po in ^<,(^,^(^),i'o), and ^ = {m e r' :m( J") = 1, m >0}, 
exactly as in | 



Remarli 3.2. If, additionally, the stage-wise costs have the form c{x,,Ut,Xt+i,£,t), where ^,, f = 1,2,. . ., are 
some random variables distributed a Polish space E according to a measure which is absolutely continuous 
with respect to some fixed P*, but may depend on x, and m,, then we need to consider larger spaces of 
arguments of a risk transition mapping: 

r = ^p(^ X j:- X s,^('^ X jr X E),Po X p^), 

y = ^^{'^ X S" X S,^(^ X jT X E),Pq X P^), 



= < m E i^' : / m(u,x,^)Po{dudxd^) = 1, m > 
All our considerations remain valid, just the notation complicates. 



4 Stochastic Multikernels 

In order to analyze Markov measures of risk, we need to introduce the concept of a multikernel. 

Definition 4.1. A multikernel is a measurable multifunction dJlfrom 2^ to the space rca( J?r, ^(<^)) of regu- 
lar measures on {^ ,3§{X)). It is stochastic, if its values are sets of probability measures. It is substochastic, 
ifO<M{B\x) < I for all M € DJl{x), B€ SS{3^), andx€ ^ . A/s convex fclosedj, if for all x^ ^ its value 
93t(x) is a convex fclosedj set. 

The concept of a multikernel is thus a multivalued generaUzation of the concept of a kernel. A measurable 
selector of a stochastic multikernel 9Jt is a stochastic kernel M such that M(x) S 9Jt(x) for all x G !X . We 
symbolically write M < 9Jt to indicate that M is a measurable selector of 9Jt. 

Recall that a composition M\ o M2 of (sub-) stochastic kernels Mi and M2 is given by the formula: 

[MioM2]{b\x)= [M2iB\y)Mi{dy\x), ^(E^(jr), x€^. (11) 

It is also a (sub-) stochastic kernel. Multikernels, in particular substochastic multikernels, can be composed 
in a similar fashion. 

Definition 4.2. IfDJli : ^ =^ rca(,^, ^{^)), i— 1,2 are substochastic multikernels, then their composition 
2ti o VJI2 is defined as follows: 

[miom2]{B\x) = |[MioM2](b|x) : Mi<mii, /= 1,2|. 

It follows from Definition |4.2| that a composition of (sub-) stochastic multikernels is a (sub-) stochastic 
multikernel. We may compose a substochastic multikernel 9Jl with itself several times, to obtain its "power": 

{mf^ mom,..om . 

k times 

The norm of a substochastic multikernel dR: ^ ^ Tca{J^,^{^)) is defined as follows: 

\\m\lr= sup ||M||,,, 

where the norm ||M||,^ is given by Q. 

The concept of a multikernel and the composition operation arise in a natural way in the context of Markov 
risk measures. If a{-, ■, ■) is a Markov risk measure, then the function a{-,x,m) is lower semicontinuous for 
all X e ^ and m € ^ (see Ruszczyriski and Shapiro ll38l Proposition 3.1]). Then it follows from ll38l 
Theorem 2.2] that for every x E ^' and m E ^ a closed convex set £/{x,m) C ^ exists, such that for all 
(p Ef v/e have 

(j{(p,x,m)^ max {(p,IJ.). (12) 

li£si/{x.m) 

In fact, we also have 

£/{x,m) = d(pG((),x,m). (13) 

In many cases, the multifunction £/ : ^ x ^ =^ .^ can be described analytically. 



Example 4.1. For the mean-semideviation model of Example 3.1 following the derivations of Ruszczyriski 
and Shapiro ll38l Example 4.2], we have 

^(x,m) = |^e.^:3(/!e^,„('^x jr,^,Po)) '^ = l+h- {h,m), ||/!||oo < K-(x), /!>o|. (14) 
Similar formulas can be derived for higher order measures. 



Example 4.2. For the Conditional Average Value at Risk of Example |3.2[ foUowing the derivations of 
Ruszczyriski and Shapiro |38 , Example 4.3], we obtain 

dm ~ a{x) ) 

Consider the formula ( [TO] i and suppose that g{xt,Uf,Xt^i) — v(xt+i) for some measurable w-bounded 
function v : ^ — > R. Using the representation ( [T2] l we can write it as follows: 

Pt{v{x,+i))^ max (v,^), a.s. (16) 

AiG.6/(.V,,7r,.,oQ,J 

In the formula above, the last bilinear form is an integral over "^ x ^. The function v() depends on x only, 
and thus it is sufficient to consider the marginal measures 

jliB)^^{'^xB), BeM{^). (17) 

Denote by L the linear operator mapping each /i e f to the corresponding marginal measure /i on {^ ,^{^)), 
as defined in (TT) . For every x we can define the set of probability measures: 

m^^{LiJ.:iJ.e£/{x,n,oQ,)}, xeJ". (18) 

The multifunction 971'^ : ^ =^ ^(^'), assigning to each x £ ^ the set 9Jt^, is a closed convex stochastic 
multikernel. We call it a risk multikernel, associated with the risk transition mapping (j(-, •, •), the controlled 
kernel Q, and the policy n. Its measurable selectors M'^ < DJl'^ are transition kernels. 
It follows that formula (\6\ can be rewritten as follows: 

Pf(v(x,+i))= max / viy)Midy). (19) 

In the risk-neutral case we have 

Pr(v(xr+i)) =E[v(x,+i)|x,] = / / v{y) Q{dy\x,,u) n{du\x,) = v{y)Ql{dy), 

with the transition kernel Q^ associated with the policy n given by Q^ = L[Kx o Qj^]. The comparison of the 
last two displayed equations reveals that in the risk-neutral case we have 

9«.? = {fix}, -^e^, (20) 

that is, the risk multikernel 971"^ is single-valued, and its only selector is the kernel Q''. In the risk-averse case, 
the risk multikernel 971" is a closed convex-valued multifunction, whose measurable selectors are transition 
kernels. It is evident that properties of this multifunction are germane for our analysis. We return to this issue 
in sectionl6] where we calculate some examples of transition multikernels. 

Remark 4.1. If m £ £/(x,m) for all x £ S^ and m € ^, then it follows from equation ( |18| l that Q^ is a 
measurable selector ofdJl"^. Moreover, it follows from ( |12[ ) that for any function (p G 'f we have 

p,((p{ut,xt+i)) > / (piu,y) [Qxr°y':x,] iduxdy)='E[(p{ut,Xt+i)\xt]. 

It follows that the dynamic risk measure Q is bounded from below by the expected value of the total cost. 



The condition m e s^{x,m) is satisfied by the measures of risk in Examples 4.1 and 4.2 



Interestingly, uncertain transition matrices were used by Nilim and El Ghaoui in |24| to increase ro- 
bustness of control rules for Markov models. In our theory, controlled multikernels (generalization of such 
matrices), arise in a natural way in the analysis of risk-averse preferences. 

Let us quickly recall continuity properties of the multifunctions involved in the construction of a Markov 
risk measure. 



Proposition 4.1. Suppose ^ ^ "V and x £ 2^ . If the controlled kernel u i— > 2(-|jc,m) is continuous, and the 
multifunction m i— >■ ^((p,x,m) is lower semicontinuous, then the function A i— > (7((p,x,A o Qx) is weakly* 
lower semicontinuous on ^(U{x)). 

Proof. For a continuous Q, the multifunction A !-> ^/(x, A o Q^) inherits the continuity properties of £/. The 
function jx i— > {(p,IJ.) is continuous on ^ (in the weak* topology). The assertion of the theorem follows 
now from the dual representation ^V2\ by (4". Theorem 1.4.16], whose proof remains valid in our setting as 
well. n 



Some comments on the assumptions of Proposition 4. 1 are in order The continuity of the kernel Q 
is a standard condition in the theory of risk-neutral Markov control processes (see, e.g., lfT6l ). If the risk 
transition mapping a {■,-,■) is continuous, then its subdifferential (\3\ is upper semicontinuous. However, 



in Proposition 4.1 we assume lower semicontinuity of the mapping m i-> (5^ff(0,x,m), which is not trivial 
and should be verified for each case. For example, the subdifferentials derived in Examples 4. 1 and 4.2 are 
continuous with respect to m. 

5 Finite Horizon Problem 

We consider the Markov model at times 1,2, . . . , T + 1 under general policies TI = {ni , 712, ... , ^t}- The cost 
at the last stage is given by a function vj+i {xj+i)- Consider the problem 

min Jt[TI,xi), (21) 

with JT{n,x\) defined by formula (pi, with Markov conditional risk measures Pr, t ~ l,...,r, with risk 
transition mappings a,(-, •, •): 



Jt{TI,xi) = piic{xi,ui,X2) + P2\c{x2,U2,x^) ^ V Pt{c{xt ,UT ,xt+\) + vt+i{xt+i)) ■ ■ -j j. 



(22) 



Theorem 5.1. Assume that the following conditions are satisfied: 

(i) For every X G i^ the transition kernel Q{x^-) is continuous; 

(ii) The conditional risk measures Pt, t = I,. . . ,T, are Markov and such that for every x£ 2^ the multifunc- 
tion J2^(x, •) is lower semicontinuous; 

(iii) The function c(-,-,-) is w-bounded, measurable, and lower semicontinuous with respect to the second 
argument; 

(iv) For every X G i%' the set U{x) is compact; 

(v) The function vj-^i (•) is w-bounded and measurable. 

Then problem \2\) has an optimal solution and its optimal value vj {x) is the solution of the following dynamic 
programming equations: 

v,{x)= min CJ,(cv + V(+i,Jc,Aog.v), x£^, t = T,...,l. (23) 

Moreover, an optimal Markov policy 77 = {tTi , . . . , Jlj-} exists and satisfies the equations: 

%t(x)£ argmin (Tr(c;r + v,+i,jc,A o gx), x£2\ f = r,...,l. (24) 



Conversely, every solution of equations (|23|l-p4)l defines an optimal Markov policy FI. 



Proof. Our proof is similar to the proof of Ruszczynski 11361 Thm. 2], but with adjustments due to the use of 
randomized strategies. We provide its short outHne. 



Owing to the monotonicity condition (B2) applied to p^, f = 1, . . . ,r, problem (21 1 can be written as 
follows: 



mm 

7l\ ,...,7lj 



1 {c{xi , Ml ,X2) H \-Pt {c{xt,ut,xt+i) + vt+1 (xt+i)) • • • j 

min \pi(c{xi,ui,X2)-\ hminp7'(c(x7',M7',X7-+i) + y7'+i(j[:7'+i)) • • • ) }. 

Consider the innermost optimization problem. Owing to the Markov structure of the conditional risk measure 
pT, this problem can be rewritten as follows: 

min OtU^ +VT+i,XT,'^oQxT)- (25) 



The prob lem becomes equivalent to (23 i for t = T , and its solution is given by (24i for t = T . By Proposi- 
tion4.1 the function A H^ ^^(c^-j- +V7-+1, XT', A og.Tr) is lower semicontinuous. As the set of A e ,^('^) 
such that X{U{xt)) = 1 is weakly* compact, the optimal randomized policy Ttjix), which is the minimizer 
in ( p5| ), exists. 

After that, the horizon T + 1 is decreased to T, and the final cost becomes vt{xt)- Proceeding in this way 
for r, r — 1 , . . . , 1 we obtain the assertion of the theorem. D 



It follows from our proof that the functions v, ( • ) calculated in ( 23 i are the optimal values of tail subprob- 
lems formulated for a fixed x, = x as follows: 

v,(x) = min pt\c{xt,Ut,x,+\)+ p,+\(c{x,+\,u,+\,Xt+2)^ |-p7-(c(x7-,M7',X7'+i) + V7-+i(x7-+i)) •••)). 

7lt Tlj \ J 

We call them value functions, as in risk-neutral dynamic programming. It is clear that we may also have 
non-stationary costs and transition kernels in this case. Also, the assumption that the process is transient is 
not needed. 



Equations ( 23 1-( 24 1 provide a computational recipe for solving finite horizon problems. 



6 Evaluation of Stationary Markov Policies in Infinite Horizon Prob- 
lems 

Consider a policy II — \%i , ;r2, . . . } and define the cost until absorption as follows: 

7oo(17,xi) = lim JT{n,xi), (26) 

where each /^(iTjXi) is defined by the formula 

JT{n,Xi) =pif c(xi,Mi,X2)+P2fc(x2,M2,JC3)H \-Pt{c{xt,ut,xt+\)) ■■■]] 

^Pi,t+i{0,c{xi,ui,X2),c(x2,U2,xj,),...,c{xt,ut,xt+i)), 
with Markov conditional risk measures Pi,t^ I,. . . ,T, sharing the same risk transition mapping (j {■,-,■). We 



assume all conditions of Theorem 5.1 We still have to index each conditional risk measure by time, because 
by definition it acts from the space i^+i to the space i^. 

The first question to answer is when this cost is finite. This question is nontrivial, because even for 
uniformly bounded costs Z, — c(x,_i , m,_i ,x,), t =2,3,..., and for a transient finite-state Markov chain, the 
limit in (|26| may be infinite, as the following example demonstrates. 
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Example 6.1. Consider a transient Markov chain with two states and with the following transition probabiU- 

ties: Qii = Q12 ~ \, Q22 — 1- Only one control is possible in each state, the cost of each transition from state 
1 is equal to 1, and the cost of the transition from 2 to 2 is 0. Clearly, the time until absorption is a geometric 
random variable with parameter j . Let xi = 1 . If the limit p6l ) is finite, then (skipping the dependence on 
17) we have 

y„(l) = iimjTil) = jlim Pi(l +^t-i(-«2)) = Pi(l +Mx2))- 

In the last equation we used the continuity of pi (•). Clearly, Jo^{2) — 0. 

Suppose that we are using the Average Value at Risk from Example 3.2 with < a < 5, to define pi (•)• 
Using standard identities for the Average Value at Risk (see, e.g., BTl Thm. 6.2]), we obtain 






(28) 



where F{-) is the distribution function of yoo(x2). As all j3-quantiles of J,„{x2) for /3 > 5 are equal to Joo{l), 
the last equation yields 

a contradiction. It follows that a composition of average values at risk has no finite limit, if < 0: < 5. 



On the other hand, if i < a < 1 , then 



^ 1 
^oo(l) if^<j3<l. 



^-i(^)^j-/~(2)=0 ifl-a<j3<2 



Formula (|28]l then yields 



UD-I + IUD- 



This equation has a solution 7oo(l) =2a/(2a— 1) 



If we use the mean-semi deviation model of Example 3.1 we obtain 



/c»(l)=]E[l+7oe(x2)]+K-E 



1+Mx2)-M[l+Joo{x2)]^ 



Thus 7oo ( 1 ) = 4/ (2 — K") , which is finite for all K" e [0,1], which are all values of K for which the model defines 
a coherent measure of risk. 

It follows that deeper properties of the measures of risk and their interplay with the transition kernel need 
to be investigated to answer the question about finiteness of the dynamic measure of risk in this case. We 
propose a condition that generalizes the Pliska condition ([T]i to the risk-averse case. 

Recall that with every risk transition mapping C7(-, •, •), every controlled kernel Q, and every decision rule 
;r, a multikemel DJl" is associated, as defined in ( fTS] ). Similarly to the expected value case, it is convenient 
to consider the effective state space ^ — ^\ {xa}, and the effective substochastic multikemel dR^ whose 
arguments are restricted to ^ and whose values are convex sets of nonnegative measures on J^ , so that 
M{b\x, u) = M{b\x,u), for all Borel sets B C S\ and all M e M''. 
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Definition 6.1. We call the Markov model with a risk transition mapping (7(-,-,-) and with a stationary 
Markov policy {tt, TT, . . . } risk-transient if a weight function w : 2^ -^ [1 , oo), w G 1^, and a constant K exist 
such 

7=1 

If the estimate (|29|l is uniform for all Markov policies, the model is called uniformly risk-transient. 



<K. 



(29) 



In the special case of a risk-neutral model. Definition 6.1 reduces to the Pliska condition ([T]i, owing to the 
equation (|20]i. 



Example 6.2. Consider the simple transient chain of Example 6.1 with the Average Value at Risk from 
Examples |3.2| and |4.2| where < a < 1. From ([TSj we obtain 

£^{x,m) = \{jXi,jX2).Q<jJ.j< — , ;■ = 1,2; jUi+Ai2 = l}- 
As only one control is possible, formula ( fTSJ ) simplifies to 

Mi = {(Mi,M2) : < M; < ^, i = 1,2; Mi +M2 = 1}, i - 1,2. 
The effective state space is just ^ = {I}, and we conclude that the effective multikernel has the form 

^^1 



971, = 



0,min( 1, 



2a/ 



For < a < I we can select M = 1 G Mi to show that 1 e (OTi)^ for all j, and thus condition ([29| is not 

satisfied. On the other hand, if ^ < a < 1, then for every M e 93ti we have < M < 1, and condition ( |29l ) is 
satisfied. 

Consider now the mean-semideviation model of Examples ITT] and 4. 1 From ([14]) we obtain 



^(x,m) = I {fiufii) ■■ l^j == nij {l+hj - {himi + h2m2)) , < hj <K, j= 1,2|, 

li = {(Mi,M2) : M; = Gu (1 +/^; - {hiQn +h2Qa)) , < hj <K,j^ 1,2}, / = 1,2. 



Wli 



Calculating the lowest and the largest possible values of fii we conclude that 



OTi = 



^(■-D-K- 



For every K G [0, 1], Definition 6.1 is satisfied 



We start our analysis from an estimate of the risk in a finite horizon model of a final cost given by a 
certain function v{xt), where T is the horizon, and v : ^ — >■ R is a measurable function with ||v||n- < °° for 
the weight function w : J%^ —>■ [1,°°), w G Y, and with v{xa) = 0. In the lemma below, we consider xi € ^ 
as a parameter of the problem, and thus pi ^(O,. .. ,0, v(jC7-)) is a function of xi . 

Lemma 6.1. Suppose a stationary policy U — {n, TT, . . . } is applied to a controlled Markov model with a 
Markov risk transition mapping (7(-, •, •). If the model is risk-transient, then there exists a function vi : j?r — > 
IR-. \\v\ \\w < °°, such that for all x\ G 2^ , and allT >\ 



and 



Pi,r(0,...,0,v(x7')) < vi(xi), 



Vl 



< 



{mj' 



(30) 
(31) 



where 371''^ is the substochastic risk multikernel implied by Tl and O. 
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Proof. By construction, 

Pi,7'(0,...,0,v(x7-)) = Pi (p2(- ••P7--i(v(xr)) •••))■ 
Applying ( [T9] l, we obtain 



Pr-i(v(x7')) = max v(y)mT-\{dy). (32) 



m7_lGOTv,-_, 



It is a function of jcr-i, which we denote as V7'_i(jt:7'_i). Since ||v||,y < oo and vv G X^, then v e X^. As the sets 
9Jl^ are weakly* compact, the maximum in ( [32] i is achieved. Moreover, 



vr-i < a^'' • V <oo. 
One step earlier, in a similar way we obtain 

Pt-2{Pt-\{v{xt))) ^ max^ VT-i{y) mr-iidy) 

= max / max v{z) mT-i{dz) mr-iidy). 

The maximizers otj-- i G 9Jly under the integral can be chosen in such a way that they form a measurable 
selector Mj-i < 2Tl'^ (see, e.g., Il35l Thm. 14.37]. On the other hand, no measurable selector can do better 
than the pointwise maximizers. We can, therefore, interchange the operations of maximization and integration 
and conclude that 

Pr-2(Pr-i(v(j^7-))) = max max v{z) MT-i{dz\y) mT-iidy). 

Similarly, the outer maximizer may be represented as a value of a certain measurable selector of OJl'^ at xj-i- 
Denoting the value of the above expression by vt-i(xj^2), we obtain 

vj-2(x)= max max / \ viz) Mj Adz\y) Mj _i{dy\x) . 

Changing the order of integration we observe that the double integral above can be represented as an integral 
with respect to a composition of the kernels Mj^2 and Mj^ \ (cf. formula (fTTji). We obtain 

"^T-i{A= max max \ v{z)\MT^ioMT^-\\{dz\x\ < max \ v{z)M{dz\x\ —v^^2{A■ 
,r .r 

The last inequality follows from the fact that Mt-ioMt-\ < (pJV^) . Therefore, vt-2 < vt-i, where 



(W) 



V < oo. 



V7"-2 < 

II -" ^ 1 1 H' — 

Continuing in this way, we conclude that 

pApii ■■■Pt-i{v{xt))---]] < max l viz) M{dz\x{) = max l viz) MidzWA. 

■^ a: 

Denoting the right-hand side by vi (xi ), we obtain the estimates (|30ll-pT|i. D 
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We can now provide sufficient conditions for the finiteness of the limit ( [26| l. 

Theorem 6.1. Suppose a stationary policy TI ~ {tt, TT, . . . } is applied to a the controlled Markov model 
with a Markov risk transition mapping (j(-, •, •). If the model is risk-transient for the policy U and the cost 
function c(-,-,-) is w-bounded, then the limit ( |26| l is finite and \\Joc{n,-)\\ < °o. If the model is uniformly 
risk-transient, then 7oo(77, •) , is uniformly bounded. 



Proof. By Lemma 3.1 each conditional risk measure pi.7'() is convex and positively homogeneous, and thus 



subadditive. For any 1 < Ti < r2 we obtain the following estimate of ( [27| l: 

772-1 (n,Xi)=Pi,72(0,Z2,...,Z72) 

Ti-\ 

<Pi,r2(0,Z2,...,Z7-,,0,...,0)+ ^ pi,r2(0,...,0,Z,+i,0,...,0) 

J=Ti 
Ti-l 

= Pi,r,(0,Z2,...,Z7-,)+ £ pi.,+i(0,...,0,Z,+i). 



(33) 



By assumption, Zj+\ < C(w{xj) +w{xj+i)), where w{x) — w{x) if x e ^, and w{xa) = 0. Owing to the 
monotonicity and positive homogeneity of the conditional risk mappings 

Plj+l(0, . . . ,0,Zy+i) < Cpi (p2(- • -p;-! (py(vi>(x,) +M>(x,+i))) • • •)) 

= Cpi (p2 ( • • • Pj- 1 {w{xj) + pj {w{xj+i)) )•••)) 

< Cpi{p2{---pj-i{w{xj)) ■■■))+Cpi{p2{-- -Pjiwixj+i)) ■■■))■ 

In the middle equation we used the fact that w{xj) is ^, - measurable, and in the last inequality - the subaddi- 
tivity of the risk measures. Since ||vi>|| = 1, Lemma 6.1 implies that 



with 



Pl{p2{---pjiwixj+i))---)) <Vj{xi) 

{my 



< 



(34) 



Substitution to (|33|) yields the estimate 



Pi,r2(0,Z2,...,Z7-2) <pi,7'i(0,Z2,...,Zrj)+2C Y, ^j(-^i)- 

J=Ti + l 



(35) 



Consider now the sequence of costs Zi , . . . jZj-, . — Zj-j+i , . . . , — Zj-^, in which we flip the sign of the costs 
Z;+i = c{x,,Ut,x,+i) for t >T\. As |Z,+i | are bounded by C(vi>(jt:() + vv'(x,+i)), the estimate (35 i applies to 
the new sequence. We obtain 



Pi,72(0,Z2, . . . ,Z7-, , -Zj-j+i, . . . , -Zp,) < Pi.n (0,Z2, . . . ,Z7-j ) +2C ^ Vj{xi). 

J=Ti + l 

By convexity and positive homogeneity of pij^ (■)' 

2pK7-i(0,Z2,...,Z7',) <Pij',(0,Z2,...,Z7-, ,Z7-| + i,...,Zp,)+Pi,r2(0,Z2,...,Z7-j,-Z7'j + i,. 

Substituting the estimate (J36]), we deduce that 

Pi,r,(0,Z2,...,Z7-2)>pi,7-,(0,Z2,...,Zr,)-2C ^ Vj{xi). 

J=Ti + l 



(36) 



-Zt,). 
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This combined with ([35]l yields 



72 

|77-,_i(7T,xi)-77',_i(n,xi)| <2C Y, ^j(-^i)- 

J=Ti + l 



In view of p4li, we conclude that 



72 



\jT,-iin,-)-jT,-i{n,-)\l<2c ^ {m^ 

J=Ti + l 



(37) 



By Definition |6.1| the right hand side of the last displayed inequality converges to 0, when Ti,T2^ °°, T\ < 72. 
Hence, the sequence of functions Jj'{n,-), T — 1,2,... is convergent to some limit Joo{n,-). Moreover, 
||7oo(n, •)|| < oo. If the model is uniformly risk-transient, then the estimate ( pT] ) is the same for all Markov 
policies n, and thus ||yoo(n, •) 11 is uniformly bounded. D 



Remark 6.1. It is clear from the proof of Theorem \6.1\ that 

J4n,xi) = lim Pit(0,Z2, ...,Zt +f(xT)) , 



T^o, 



(38) 



for any measurable function / : ,^ — >■ R, with / , <°°, because c{xx-i, Ut,xj) +f{xT) is still w-bounded. 

This analysis allows us to derive dynamic programming equations for the infinite horizon problem, in the 
case of a fixed Markov policy. 

Theorem 6.2. Suppose a controlled Markov model with a Markov risk transition mapping (7(-, •, •) is risk- 
transient for the stationary Markov policy U = {tt, TT, . . . }, with some weight function w(-). Then a measur- 
able function v. j^ — >■ R, with \\v\\„ < °°, satisfies the equations 



v{x) = (7(cv + v,x, k{x) o Q^^ , X e ^, 
v{xa) = 0, 



(39) 
(40) 



if and only if v{x) = Jac{TI^x) for all x ^ S^ . 



Proof. Denote Z, = c(ji:,_i,m,_i,x,). Suppose a measurable function v(-) satisfies the dynamic programming 
equations ([39])-(|40]l. Since ||y||,j, < °° and w G 1^, then also v e 1^. By assumption, c(-, •, •) is w-bounded, and 
thus C;f(-, •) G 'f . Consequently, the right-hand side of ( |39l ) is well-defined. By iteration of (39 1, we obtain 
for all jci S SK the following equation: 

v(xi) =pl{c(xl,u\,x^)+p^\c{x^,u2,xi)^ VPt{c{xt,ut,xt+i)^v{xt^{)) ••• j 

Denote Z, — c(xr_i,Mr_i,.X(). Using monotonicity and subadditivity of the conditional risk measures we 
deduce that: 

Pl,r+l(0,Z2,...,Z7'+i +v(.X7'+i)) < Pi,7'+l(0,Z2,...,Z7'+i) +Pi,7'+l(0,0,...,v(x7-+i)). (41) 

By Lemma [6T[ 

v(xi) =Pl7'+i(0,Z2,...,Z7-+i +v(x7-+i)) < pL7-+i(0,Z2,...,Zr+i))+(i7-(xi), (42) 

with 



\dA < 



{m'') 



T-l 



(43) 
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By convexity of pi.7'+i(-), 

2pi,7-+i (0,Z2, . . . jZr+i) < Pij+i (0,Z2, . . . jZr+i + v(x7'+i)) +pi,r+i (0,Z2, . . . jZr+i - v{xt+i)) 

= v(xi) +pi,7'+i (0,Z2, . . . ,Zt+i - v{xT+i)) . (44) 

Similar to (|4T]l-(|42]l, 

Plt+i{0,Z2, ■ ■ ■ ,Zt+i - v{xt+i)) < Pi.t+i{0,Z2,...,Zt+i) +dT{xi). 
Substituting into (|44]i we obtain 

v'(-*^i) > PiT+i(0,Z2,...,Zr+i))-(i7-(xi). 
Combining this estimate with ( |42] i and using ( |43| ) we conclude that 

||v(-)-/r(n',-)|| <lkr|| ^0, as T-)-oo. 

Thus v(-) = 7oo(n, •), as postulated. 

To prove the converse implication we can use the fact that all conditional risk measures p,() share the 
same risk transition mapping to rewrite pT] ) as follows: 

JT{n,Xi) ^ Pi{c{x\,Ui,X2)+JT-\{n,X2)). 

The function pi(), as a finite-valued coherent measure of risk on a Banach lattice, is continuous (see ll38l 
Prop. 3.1]). Since [[/^(IT,-) — 7«,(J7,-)|| — > 0, as T — > oo, then the sequence {/^(n,-)} is also convergent 
in the space "Y . Therefore, 



lim77'(17,ji;i) = pi fc(xi,Mi,X2) + lim 77'_i(i7,X2) 



This is identical with equation (39 1 with v(-) = /oo(n, •). Equation (40l is obvious. D 



7 Dynamic Programming Equations for Infinite Horizon Problems 

We shall now focus on the optimal value function 

J*{x)= inf 7oc(iI,x), jcG^, (45) 

nen™ 

where n^^'^ is the set of all stationary Markov policies. 

Theorem 7.1. Assume that the following conditions are satisfied: 

(i) For every X € ^ the transition kernel Q{x,-) is continuous; 

(ii) The conditional risk measures Pt, t ^ I,. . . ,T, are Markov and such that for every x £ ^ the multifunc- 
tion £/{x, •) is lower semicontinuous; 

(iii) The function c(-, •, •) is w-bounded and lower semicontinuous with respect to the second argument; 

(iv) For every X G ^ the set U(x) is compact; 

(v) The model is uniformly risk-transient. 
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Then a function v : ^ — )■ R, with \\v\\yf < °°, satisfies the equations 

v(x)= inf g(cx + v,x,XoQx), xeJT, (46) 

v(xa) = 0, (47) 

if and only if v(x) = J* (x) for all x € ^ . Moreover, the minimizer Tl* {x), x € ^ , on the right hand side of 
( |46| l exists and defines an optimal randomized Markov policy TI* = {tt* , TT* , . . . }. 



Proof. Suppose J*(-) is given by ( |45] l. The set of policies of the form {X,n,n,. ..} is larger than II'^'^, and 
thus 

J*{xi)> inf pi(c{xi,ui,X2)+Joo{n,X2)). 
nenRM 

By the monotonicity of pi (•) we can move the infimum operator inside: 

J*{x\]> inf pi (c(a:i,mi,^2)+ inf Jocill.xj)] 
Xe.9'(u(xi)) V jienRM / 

= inf pi(c(xi,Mi,ji:2)+7*(x2)). 

As the model is uniformly risk-transient, 1 1 7* 1 1 , < oo, and the right-hand side is well-defined. Thus J*{-) 
satisfies the inequality 

J*{x)> inf aic^+J*,x,XoQ^, xeJT. (48) 

The mapping A h^ (7(cv+7*,x,Aogx) is continuous for all jc, and the set of A G ^(^) suchthat A(C/(x)) = 1 



is weakly* compact. Therefore, there exists a minimizer %* (x) on the right hand side of (48 i. Hence, 

J*{x)>a{cx^'J\x,%*{A°Qx), xeSC. 
Iterating this inequality we conclude that J* [xi ) is bounded below by 

/*(^i)>piT(0,Z2,...,Zr+y*(xr)), (49) 

with the sequence of controls and states resulting from the stationary Markov policy IT* = {;r*,7r*,. . .}. 
Owing to Remark [6T| we can pass to the limit on the right-hand side and obtain the inequality: 

y*(xi)>7»(n*,xi), XI e^. 



It follows that U* is the optimal stationary Markov policy, and thus J*{-) = Joa{n* , •). By Theorem 6.2 
relation ( |48] l is an equation, which proves (|46])-(|47|. 

To prove the converse implication, suppose v(-) satisfies (|46|-(|47| and \\v\\„ < °o. By the continuity of the 
mapping A i-> a[cx + v,x,X og_() and weak* compactness of the set of A € ^(fU) such that A(t/(x)) — 1, 
there exists a randomized control 7r(), which is a minimizer on the right hand side of ( |46l ). We obtain the 
equation 

v(x) = (7(cv + v,x,7r(x)ogx), x^.%. 



By Theorem 6.2 



v(x) = j^{n,x) > 7*(x), xesr, (50) 



where 11 = {ft, ft,...}. On the other hand, it follows from (46i that for the optimal policy 77* = {jt* ,7t*,. ..} 
we have 

v{x)<a(c,^ + v,x,n*{x)oQ^, xe^. (51) 
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The risk transition mapping a is nondecreasing with respect to the first argument. Therefore, iterating in- 



equaHty (51 1 we obtain an inequality corresponding to (49 1: 

v{xi) < Pi,t{0,Z2, . . . ,Zt + v{xt)) , 
Passing to the limit with T ^> °o and applying Remark[6T| we conclude that 



v{x)<j^{n*,x)=j*{x), xejr. 



The last estimate together with (50i implies that v'() = 7*() and that both stationary policies 17* and 77 are 



optimal. n 

We can now address the case of general non-stationary policies. For a policy A = {Xi , A2, . . . } we define 



Jo!,{A,xi) = liminfy7'(A,ji:i) 



J{xi) =infyoo(A,jt:i). 

A 



and 



Theorem 7.2. Assume that the conditions of Theorem \7. l\ are satisfied, together with the following assump- 
tion: there exists a constant C such that Joo{A^x) > —Cw{x) for all x d Sf^ and for all policies A. Then a 
function v : ^ — >■ R, with \v\„ < °a, satisfies the equations ( |46[ )-( |47l ) if and only ifv{x) = J{x)for all x G 2^ . 
Moreover, the minimizer n* {x), x G ^ , on the right hand side of ( |46| l exists and defines an optimal policy 
n* = {n*,n\...}. 

Proof As for stationary Markov policies U we have ||yoo(n, •) 11 , < 00, in view of the additional assumption 
we have ||/|| _ < 00. DenoteA' = {AijAa,. . .}. Due to the monotonicity and continuity of pi(), we have the 
chain of relations 

J{xi) — inf Uminfpi(c(jci,Mi,JC2)+-/7'-i(^',-^2)) 

> inf liminfpi (cfjci-Mi-Jc?) + inf Jti^ ,X2)) 
A,,A2,... r^~ ^ z>T-l ' 

= inf Hm pi(c(xi,Mi,X2)+ inf Jx(A ,X2)) 

= inf pi(c(xi,Mi,X2)+liminf77'_i(A\x2)) = inf pi(c(xi,Mi,X2) +7oo(A\x2)), 

Owing to the monotonicity of p\ (•), we can move the minimization with respect to A ' inside the argument, 
to obtain 

J(x\ ) > infpi ( c(ji:i , u\ ,^2) + inf7oo(A ' ,X2) ) = inf pi {c(x\ , u\ ,X2)+J(x2)) ■ 

Thus J{-) satisfies an inequality analogous to ( |48| l: 



J{x)> inf a{c^+J\x,XoQ^^ xeJf. (52) 



We can now repeat the argument from the proof of Theorem 7.1 Denoting by A the minimizer above. 



iterating inequality ( [52| i, and passing to the limit we conclude that 

where A — {A , A .... } is a stationary Markov policy. Therefore, optimization with respect to stationary 



Markov policies is sufficient, and the result follows from Theorem 7.1 D 
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Our additional technical assumption that Joo{A,x) > —Cw{x) is obviously true for nonnegative costs 
c(-,-,-). More generally, it is true in the case when the cost function is w-bounded, the model is transient. 



and H G £/{x,lx), for all x e ^' and ji e .-#. Indeed, by virtue of Remark 4.1 the dynamic risk measure is 



bounded from below by the expected value of the cost, which is finite in this case. 

8 Randomized versus Deterministic Control 

Observe that the mapping A i-^- a (c^ + v,x,Xo g,,) , which plays the key role in the dynamic programming 
equation ( |46] l, is nonlinear, in general, as opposed to the expected value model, where 

a{cjc + v,x,XoQ^) ^ / [c{x,u,y)+v{y)) Q{dy\x,u) X{du\x). 

U{x)SC 

In the expected value case, it is sufficient to consider only the extreme points of the set ^{U{x)), which are 
the measures assigning unit mass to one of the controls u E U{x): 

inf / (c{x,u,y)+v{y)) Q{dy\x,u) X{du\x) ^ inf (c{x,u,y) + v{y)) Q{dy\x,u). 
le.3^(u(x)) J J ueu(x)J 

u{x) X ,r 

In the risk averse case this simplification is not justified and a randomized policy may be strictly better than 
any deterministic policy. Of course, we may always restrict the set of possible decision rules to deterministic 
rules, and solve the corresponding version of the dynamic equation (|46|): 



v{x)= min (7(c, + v,jc,A og^-), x<eX, (53) 

X<ie7>^(V{x]) 

where ^ (f/(x)) denotes the set of Dirac measures supported at Vix). For a fixed x e ^ and a Dirac 
measure A = 5„, the function c^ + v = c(jc,m) + v[y) is only a function of the next state y e JT, and the 
measure A o Q,^ is the measure 2(-|x,m) on the state space !X . We can, therefore, rewrite ( |53] l in a simpler 
form 

v(x)= min \c{x,u)^a{\\x,Q{-\x,u))\, xe J^, (54) 

where (with a slight abuse of notation) a : Jfp(S',^( jr),Pt) y. SC y. Jf^( jr,^( jr'),^^) -> R, and ff(-, •, •) 
is a coherent measure of risk with respect to its first argument. In equation (J54| we also used the translation 
property of coherent measures of risk. This is almost exactly the form of the dynamic programming equation 
which we derived in |36| for discounted problems, but with the discount factor a = 1. 

A question arises whether it is possible to identify cases in which deterministic policies are sufficient. It 
turns out that we can prove this for a class of measures of risk which are called optimized certainty equivalents 
0: 

ct((j!),x,;U) = inf / {^+G{(?{u,y)-^\x)^ ytidwKdy). (55) 

v(x) X ,r 

In formula ( |55| l, the function G : R — > R is nondecreasing and convex, with G(0) = and 1 G (9G(0). We 
assume that |G(z) | < c(l +z'') for all z S R, with some c > and p > 1, and we define Y using the same p, 
so that the integral above is finite for (p gY. 



Lemma 8.1. If the risk transition mapping has the form ( |55| l then the dynamic programming equations ( |46| 
have a solution in deterministic decision rules. 
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Proof. Interchanging the integration and the infimum in the definition of an optimized certainty equivalent, 
we obtain a lower bound 

(7((p,x,Aog^) = inf / I {E,+G{^{u,y)-t,;x)\ Q{dy\x,u)X{du\x) 

^GR J J 

U{x) X 

> [ inf [ {^+G((p{u,y)~^;x)] Q{dy\x,u) X{du\x). 
J |grJ 

Uix) S' 

The above inequality becomes an equation for every Dirac measure A . On the right-hand side of (|46| we have 

inf (j{cj; + v,x,XoQ^) > inf / inf \^ +G(c{x,u,y)+v{y) — ^;x)\ Q{dy\x,u) X{du\x). 

Xe^[U{x)) ' ' X<i3'iJJ[x)) J §GRJ 

Uix) X 

As the right hand side achieves its minimum over A € ^(C/(x)) at a Dirac measure concentrated at an extreme 
point of V(x), and both sides coincide in this case, the minimum of the left hand side is also achieved at 
such measure. Consequently, for risk transition mappings of form ( |55] l deterministic Markov policies are 
optimal. n 

9 Illustrative Examples 

We illustrate our models and results on two simple examples. 

9.1 Asset Selling 

Let us at first consider the classical example of asset selling originating from Karlin |19|. Offers 7, arriving 
in time periods t = 1,2,... are independent integer-valued integrable random variables. At each time we may 
accept the highest offer received so far, or we may wait, in which case a waiting cost c is incurred. Denoting 
the random stopping time by T we see that the total "cost" equals Z = cT — maxo<y<T i;. The problem is an 
example of an optimal stopping problem, a structure of considerable theoretical and practical relevance (see, 
e.g., ^inlar |9|, Dynkin and Yushkevich inTlfT2l . and Puterman [1321). 

Formally, we introduce the state space ^ = {jca} U {0, 1 , 2, . . . }, where ;ca is the absorbing state reached 
after the transaction, and the other states represent the highest offer received so far The control space is 
'^ — {0, 1}, with representing "wait" and 1 representing "sell." The state evolves according to the equation 

{max(x,,F,+i) ifw, =0, 
xa it M( = 1 . 

This formula defines the transition kernel Q. The cost is 

\c ifu,=0, 
c{x,,u,)^< 

I —Xi it M( = 1 . 

The expected value version of this problem has a known solution: accept the first offer greater than or equal 
to the solution x of the equation 

c = E[(F-f)+]. (56) 

We shall solve the risk-averse version of the problem in deterministic policies. For a stationary risk transition 
mapping a, equation ( |54] i has the form: 

v{x) =mm\ ~x,c + a(^v,x,Qx)>, xeS". (57) 
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Suppose a is law invariant (Definition |3.2| i. As the distribution of v with respect to the measure Q^ is the 
same as the distribution of v(max(ji:,y)) under the measure Py of F, we obtain 

(y{v,x,Q^) =cr(v(max(A;,F)),A;,fV)- 

Suppose our attitude to risk does not depend on the current state, that is, a does not depend on its second 
argument. Using ( [T2| ), we may rewrite the last equation as follows: 

o{v,x,Q^ =maxE^[v(max(jc,y))]. 

The convex closed set of probability measures si/ is fixed. Equation ( [57] ) takes on the form 

v(x) = min-^ — x,c + max]E^ [v(max(jc.y))] >, x^^. (58) 

Observe that v{x) < —x and thus v(max(x,y)) < — max(x.y). The last displayed inequality implies that 
v(x) < min-^ — x,c + max]Eu [ — max(x,F)] > = min-^ —x,c— minlEu [max(x,y)] >, x G ^. 

If the offer at level x is accepted, then v{x) = —x. We obtain the inequality: 

minE[(y-jc)+l <c. 
lies/ 

This suggests the solution: accept any ojfer that is greater or equal to the solution x* of the equation 

minEr(y-x*)+l =c; (59) 

ifx < X*, then wait. The corresponding value function equals: 

v'*(x) ~ — max(x,x*). 



Equation ( [58[ l can be verified by direct substitution. 

Observe that the solution (|59| of the risk-averse problem is closely related to the solution (|56]l of the 
expected value problem. The only difference is that we have to account for the least favorable distribution of 
the offers. If Py ^ s^ , then the critical level x* < x. 

9.2 Organ Transplant 

We illustrate our results on a risk-averse version of a simplified organ transplant problem discussed in Alagoz 
et. al. |[T]. We consider the discrete-time absorbing Markov chain depicted in Figure [T] State S, which is 
the initial state, represents a patient in need of an organ transplant. State L represents life after a successful 
transplant. State D (absorbing state) represents death. Two control values are possible in state S: W (for 
"Wait"), in which case transition to state D or back to state S may occur, and T (for "Transplant"), which 
results in a transition to states L or D. The probability of death is lower for W than for T, but successful 
transplant may result in a longer life, as explained below. In other two states only one (formal) control value 
is possible: "Continue". The rewards collected at each time step are months of life. In state S a reward equal 
to 1 is collected, if the control is W; otherwise, the immediate reward is 0. In state L the reward r(L) is 
collected, representing the sure equivalent of the random length of life after transplant. In state D the reward 
isO. 

Generally, in a cost minimization problem, the value of a dynamic measure of risk Q is the "fair" sure 
charge one would be willing to incur, instead of a random sequence of costs. In our case, which will be a 

21 




qs,L(-) 



, , ^----tON '■(s,w)=i 

''«'-'C_i^S|r(S,T)=0 




qL,D=i 



qD,D=i 

r(D)=0 

Figure 1: The organ transplant model. 



maximization problem, we shall work with the negatives of the months of life as our "costs." The value of 
the measure of risk, therefore, can be interpreted as the negative of a sure life length which we consider to be 
equivalent to the random life duration faced by the patient. 

Let us start from describing the way the deterministic equivalent length of life r(L) at state L is calculated. 
The state L is in fact an aggregation of n states in a survival model representing months of life after transplant, 
as depicted in Figure l2] 




Figure 2: The survival model. 

In state i ~ !,....«, the patient dies with probability pi and survives with probability 1 — /?,. The prob- 
ability pn~ I- The reward collected at each state /= 1 ,...,« is equal to 1 . In order to follow the notation 
of our paper, we define the cost c(-) = —'"(■)■ For illustration, we apply the mean-semideviation model of 
Example 3.1 with K=\. 

The risk transition mapping has the form: 



(7((p,/,v)= Ev[(p] +K-Ev[(^-Ev[(p])J. 

expected value 



(60) 



semideviation 



Owing to the monotonicity property B2, o{(p,i, v) < 0, whenever (p{-) < 0. 

In ( |60l l, the measure v is the transition kernel at the current state /, and the function (p{-) is the cost 
incurred at the current state and control plus the value function at the next state. At each state i — 1 ,...,«— 1 
two transitions are possible: to D with probability /?,■ and (p — ~l, and to /+ 1 with probability 1 — /?, and 
9 = — 1 + v,+i(/+ 1). At state / = n the transition to D occurs with probability 1, and (p = —I. Therefore, 
v„(«) = -1. 
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The survival problem is a finite horizon problem, and thus we apply equation ( [23[ i. As there is no control 
to choose, the minimization operation in is eliminated. The equation has the form: 

Vi{i)^a{(p,i,Qi), /= l,...,n-l, 

with (p and Qi as explained above. By induction, v,(/) < 0, for ;=« — 1,« — 2,...,1. 

Let us calculate the mean and semideviation components of (|60]l at states / = 1 , . . . , n — 1 : 



Ee,[(<P-3Ee,[(p])J=]Ee,[((p + i-(i-/,,>,+iO- + i))+] 

= ;.,■(- 1 + 1 -(i-p,-)v,+i(/+i))^ + (i-p,)(-i+v,-+i(/+i) + i-(i-p,-K+i(/+i)) + 

= Pi{-i^-Pi>.+iii+l))^ + il-p,){PiVi+iii+l)) + 
= -piil-pi)vi+i{i+l). 

In the last equation we used the fact that v,+ i (/ + 1) < 0. For / = 1, . . . ,n — 1, the dynamic programming 
equations ( |23] l take on the form: 

Vi{i) =: -1 + (1 - pi)vi+i{i + I) -K pi{l - pi)vi+i{i + I), / = n- l,n-2, . . . , 1. 

expected value semideviation 

The value v'(l) is the negative of the risk-adjusted length of life with new organ. For K" = the above formulas 
give the negative of the expected length of life with new organ. 

In our calculations we used the transition data provided in Table [T] They have been chosen for purely 
illustrative purposes and do not correspond to any real medical situation. 



Control 



W 
T 



D 



0.99882 0.00118 

0.90782 0.09218 



Table 1: Transition probabilities from state S. 

For the survival model, we used the distribution function, F{x), of lifetime of the American population 
from Jasiulewicz [1 1811 . It is a mixture of Weibull, lognormal, and Gompertz distributions: 

J^W=wi(l-exp(-(^)^))+W24>( '°^-^''" )+W3(l-exp(-^(g"^-l))), x > 0. 
The values of the parameters and weights, provided by Jasiulewicz ifTSl . are given in Table |2] 



Distribution 



Parameters Weights 



Weibull 


5 ^ 0.297, 15 = 0.225 


wi^ 0.0170 


Lognormal 


m = 3.11, (7 = 0.218 


VV2 = 0.0092 


Gompertz 


fe = 0.0000812, a =0.0844 


W3 = 0.9737 



Table 2: Values of parameters for F{x). 

Then, we calculated the probability of dying at age k (in months) as follows: 

_ F{k/l2+l/24)-F{k/l2-l/24) 
^^ l-F(A:/12-l/24) ' ' '' 
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The maximum lifetime of the patient was taken to be 1200 months, and that the patient after transplant has 
survival probabilities starting from k = 300. Therefore, n — 900 in the survival model used for calculating 

r(L). 

Let A = (A^, Aj.) be the randomized policy in the state S and let A = {A G R^ : A^^ + A^ = 1, A > O}. 



The dynamic programming equation (46 1 at S takes on the form 



v(S)=min A^[^33(W)(v(S)-l))+^,,j,(W)(v(D)-l)]+A,[^,,(T)y(L)+^3^(T)v(D)] 

expected value fi 

+ »c(A^[^,jW)(y(S)-l-M)++?sD(W)(v(D)-l-M)J 

^^ V ' 

semideviation . . . 

+ A,[^,,,(T)(v(L)-m),+?s,d(T)(v'(D)-m) 



. . . semideviation 

In the semideviation parts, we wrote ji for the expectation of the value function in the next state, which is 
given by the first underbraced expression, and which is also dependent on A. Of course, the above expression 
can be simplified, by using the fact that v(L) < v(S) < v(D) = 0, but we prefer to leave it in the above form 
to illustrate the way it has been developed. 

We compared two optimal control models for this problem. The first one was the expected value model 
(k = 0), which corresponds to the expected reward r(L) — 610.46 in the survival model. Standard dynamic 
programming equations were solved, and the optimal decision in state S turned out to be W. 

The second model was the risk-averse model using the mean-semideviation risk transition mapping with 
K = 1. This changed the reward at state L to 515.35. We considered two versions of this model. In the first 
version, we restricted the feasible policies to be deterministic. In this case, the optimal action in state S was 
T. In the second version, we allowed randomized policies, as in our general model. Then the optimal policy 
in state S was W with probability A^ = 0.9873 and T with probability A^ = 0.0127. 

How can we interpret these results? The optimal randomized policy results in a random waiting time 
before transplanting the organ. This is due to the fact that immediate transplant entails a significant probability 
of death, and a less risky policy is to "dilute" this probability in a long waiting time. This cannot be derived 
from an expected value model, no matter what the data, because deterministic policies are optimal in such a 
model: either transplant immediately or never. 
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